Ecological niche models and species distributions

Showing some neat features of R!
Author
Affiliation
Published

April 15, 2024

Note

In this lab, we will explore some Correlative models using data at species level (occurrence data points) and environmental data at large spatial scales. We will download occurrence data for oak species distributed in the Americas from GBIF and environmental (climatic) data from WorldClim.

1 Set up your data and your working directory

To do this laboratory you will need to install the following packages:

Code
packages <- c("terra", "geodata", "sp", "spThin", "sf", "maps", "dismo", 
              "CoordinateCleaner", "rgbif", "countrycode", "rworldmap", 
              "Metrics", "ROCR", "ggraph", "plotly")
# Package vector names
Function install.packages()

You can use the function install.packages() to install the packages.

If you don’t want to install the packages one by one, you can use the next command.

Code
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())

if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], dependencies = TRUE)
}

Although we installed several packages in the previous lines two packages are missing, {torch} and {cito}. These two packages are necessary for modeling the environmental niches (shadows) or Grinnellian niches under a deep learning framework–in other words, its a method that teaches computers to process data in a way that is inspired by the human brain. {torch} is the main package and {cito} is a wrapper that facilitate the implementation of deep neural networks. We will install these packages separately because.

Let’s start with {torch}

Code
# check package 
if(!require('torch', quietly = TRUE)) install.packages('torch')
library('torch') 

#install torch
if(!torch_is_installed()) install_torch()

A last step is to install {cito}

Code
if (!("cito" %in% installed.packages())) {
    install.packages("cito")
}

Call the packages

Code
sapply(packages, require, character.only = TRUE)
Loading required package: terra
terra 1.7.39
Loading required package: geodata
Loading required package: sp
The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)
Loading required package: spThin
Loading required package: spam
Spam version 2.9-1 (2022-08-07) is loaded.
Type 'help( Spam)' or 'demo( spam)' for a short introduction 
and overview of this package.
Help for individual functions is also obtained by adding the
suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

Attaching package: 'spam'
The following objects are masked from 'package:base':

    backsolve, forwardsolve
Loading required package: grid

Attaching package: 'grid'
The following object is masked from 'package:terra':

    depth
Loading required package: fields
Loading required package: viridisLite

Try help(fields) to get started.

Attaching package: 'fields'
The following object is masked from 'package:geodata':

    world
The following object is masked from 'package:terra':

    describe
Loading required package: knitr

Attaching package: 'knitr'
The following object is masked from 'package:terra':

    spin
Loading required package: sf
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
Loading required package: maps
Loading required package: dismo
Loading required package: raster
Loading required package: CoordinateCleaner
Loading required package: rgbif

Attaching package: 'rgbif'
The following object is masked from 'package:torch':

    dataset
Loading required package: countrycode
Loading required package: rworldmap
Please note that 'maptools' will be retired during October 2023,
plan transition at your earliest convenience (see
https://r-spatial.org/r/2023/05/15/evolution4.html and earlier blogs
for guidance);some functionality will be moved to 'sp'.
 Checking rgeos availability: FALSE
### Welcome to rworldmap ###
For a short introduction type :      vignette('rworldmap')
Loading required package: Metrics
Loading required package: ROCR
Loading required package: ggraph
Loading required package: ggplot2

Attaching package: 'ggraph'
The following object is masked from 'package:sp':

    geometry
Loading required package: plotly

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:raster':

    select
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
            terra           geodata                sp            spThin 
             TRUE              TRUE              TRUE              TRUE 
               sf              maps             dismo CoordinateCleaner 
             TRUE              TRUE              TRUE              TRUE 
            rgbif       countrycode         rworldmap           Metrics 
             TRUE              TRUE              TRUE              TRUE 
             ROCR            ggraph            plotly 
             TRUE              TRUE              TRUE 
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks raster::extract(), terra::extract()
✖ dplyr::filter()  masks plotly::filter(), stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ purrr::map()     masks maps::map()
✖ dplyr::select()  masks plotly::select(), raster::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Double-check your working directory.

Function getwd()

You can use the function getwd() to get the current working directory.

2 Prepare data

2.1 Download species occurrences from GBIF

Get some occurrence data for our species from GBIF, directly within R. This may take some time, given the number of occurrences for the selected species.

NOTE: we need to have an internet connection.

Set the vector with some species scientific names (N = 5) to download occurrence data from GBIF.

Code
spp <- c("Quercus virginiana", "Quercus minima", "Quercus alba", "Quercus fusiformis", "Quercus oleoides")

Download data using the vector of species names

Code
gbif_data <- occ_search(scientificName = spp, 
                      hasCoordinate = TRUE, 
                      limit = 2000)  

#decrease the 'limit' if you just want to see how many records there are without waiting all the time that it will take to download the whole dataset.

Save the raw data

Code
dir.create("data")
dir.create("data/OCC")

# save the donwloaded in the working directory
save(gbif_data, file = "data/OCC/oaks_raw_occ.RDATA")

Inspect the downloaded data. Note that we have downloaded data for five species but for simplicity let’s pick one species for this lab. I will select Quercus minima but you can select any other species.

The returned object is a list of five elements (i.e., the number of our species vector), thus, to select the focal species just use [[position of the species in the vector]] to get access to the species. For example, I selected Quercus minima, so, to get access to this species I just need to type gbif_data[[2]], let’s see:

Code
#gbif_data
# if, for any species, "Records found" is larger than "Records returned", you need to increase the 'limit' argument above -- see help(occ_data) for options and limitations

# check how the data are organized:
names(gbif_data)

# metadata of the object #2
names(gbif_data[[2]]$meta)

# the data
names(gbif_data[[2]]$data)

Wow, there are a bunch of columns, let’s select some columns that are relevant for our purposes.

Code
# get the columns that matter for mapping and cleaning the occurrence data:
occ_quemin <- gbif_data[[2]]$data %>% 
  dplyr::select(species, decimalLongitude, 
                decimalLatitude, countryCode, individualCount,
                gbifID, family, taxonRank, coordinateUncertaintyInMeters,
                year, basisOfRecord, institutionCode, datasetName)

occ_quemin
# A tibble: 808 × 13
   species   decimalLongitude decimalLatitude countryCode individualCount gbifID
   <chr>                <dbl>           <dbl> <chr>                 <int> <chr> 
 1 Quercus …            -81.0            28.1 US                       NA 45163…
 2 Quercus …            -81.1            27.9 US                       NA 45168…
 3 Quercus …            -81.1            27.9 US                       NA 45168…
 4 Quercus …            -82.3            27.2 US                       NA 45165…
 5 Quercus …            -80.2            27.0 US                       NA 45165…
 6 Quercus …            -82.0            27.1 US                       NA 45163…
 7 Quercus …            -81.0            27.6 US                       NA 45166…
 8 Quercus …            -81.0            27.6 US                       NA 45168…
 9 Quercus …            -81.3            26.8 US                       NA 45167…
10 Quercus …            -81.9            27.0 US                       NA 45164…
# ℹ 798 more rows
# ℹ 7 more variables: family <chr>, taxonRank <chr>,
#   coordinateUncertaintyInMeters <dbl>, year <int>, basisOfRecord <chr>,
#   institutionCode <chr>, datasetName <chr>

Let’s look the data.

Code
glimpse(occ_quemin)
Rows: 808
Columns: 13
$ species                       <chr> "Quercus minima", "Quercus minima", "Que…
$ decimalLongitude              <dbl> -81.03141, -81.12212, -81.12194, -82.275…
$ decimalLatitude               <dbl> 28.10737, 27.90959, 27.90848, 27.23522, …
$ countryCode                   <chr> "US", "US", "US", "US", "US", "US", "US"…
$ individualCount               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ gbifID                        <chr> "4516323040", "4516828390", "4516820683"…
$ family                        <chr> "Fagaceae", "Fagaceae", "Fagaceae", "Fag…
$ taxonRank                     <chr> "SPECIES", "SPECIES", "SPECIES", "SPECIE…
$ coordinateUncertaintyInMeters <dbl> 5113, 10, 6, NA, 4, 10, 3, 3, 3, 4, 4, 1…
$ year                          <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024…
$ basisOfRecord                 <chr> "HUMAN_OBSERVATION", "HUMAN_OBSERVATION"…
$ institutionCode               <chr> "iNaturalist", "iNaturalist", "iNaturali…
$ datasetName                   <chr> "iNaturalist research-grade observations…

Now let’s clean the data a little bit. First, removing all NA in the occurrence data.

Code
occ_quemin <- occ_quemin %>% 
  filter(!is.na(decimalLongitude & !is.na(decimalLatitude)))

occ_quemin
# A tibble: 808 × 13
   species   decimalLongitude decimalLatitude countryCode individualCount gbifID
   <chr>                <dbl>           <dbl> <chr>                 <int> <chr> 
 1 Quercus …            -81.0            28.1 US                       NA 45163…
 2 Quercus …            -81.1            27.9 US                       NA 45168…
 3 Quercus …            -81.1            27.9 US                       NA 45168…
 4 Quercus …            -82.3            27.2 US                       NA 45165…
 5 Quercus …            -80.2            27.0 US                       NA 45165…
 6 Quercus …            -82.0            27.1 US                       NA 45163…
 7 Quercus …            -81.0            27.6 US                       NA 45166…
 8 Quercus …            -81.0            27.6 US                       NA 45168…
 9 Quercus …            -81.3            26.8 US                       NA 45167…
10 Quercus …            -81.9            27.0 US                       NA 45164…
# ℹ 798 more rows
# ℹ 7 more variables: family <chr>, taxonRank <chr>,
#   coordinateUncertaintyInMeters <dbl>, year <int>, basisOfRecord <chr>,
#   institutionCode <chr>, datasetName <chr>

Cleaning geographic coordinates by flagging potentially erroneous locations.

Code
#convert country code from ISO2c to ISO3c
occ_quemin$countryCode <- countrycode::countrycode(occ_quemin$countryCode, 
                                                   origin = 'iso2c', 
                                                   destination = 'iso3c')

#flag problems
occ_quemin <- data.frame(occ_quemin)

occ_quemin_flag <- clean_coordinates(x = occ_quemin, 
                           lon = "decimalLongitude", 
                           lat = "decimalLatitude",
                           countries = "countryCode",
                           species = "species",
                           tests = c("capitals", "centroids",
                                    "equal", "zeros", "countries")) # most test are on by default
Testing coordinate validity
Flagged 0 records.
Testing equal lat/lon
Flagged 0 records.
Testing zero coordinates
Flagged 0 records.
Testing country capitals
Flagged 0 records.
Testing country centroids
Flagged 0 records.
Testing country identity
Flagged 23 records.
Flagged 23 of 808 records, EQ = 0.03.

See the results

Code
summary(occ_quemin_flag)
    .val     .equ     .zer     .cap     .cen     .con .summary 
       0        0        0        0        0       23       23 

It seems that 23 observations were flagged, let’s plot the points to see the which ones.

Code
plot(occ_quemin_flag, lon = "decimalLongitude", lat = "decimalLatitude")

Exclude or filter the flagged records

Code
occ_quemin_clean <- occ_quemin_flag %>% 
  filter(.summary == TRUE)

#occ_quemin_clean

Plot the cleaned data

Code
plot(occ_quemin_clean, lon = "decimalLongitude", lat = "decimalLatitude")

Now transform our occurrence data to a spatial object, this will help us to visualize better our data.

Code
quemin_spt <- st_as_sf(occ_quemin_clean, 
                       coords = c("decimalLongitude", "decimalLatitude"), 
                       crs = 4326) #%>%
  #st_cast("MULTIPOINT")

# explore the data
glimpse(quemin_spt)
Rows: 785
Columns: 19
$ species                       <chr> "Quercus minima", "Quercus minima", "Que…
$ countryCode                   <chr> "USA", "USA", "USA", "USA", "USA", "USA"…
$ individualCount               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ gbifID                        <chr> "4516323040", "4516828390", "4516820683"…
$ family                        <chr> "Fagaceae", "Fagaceae", "Fagaceae", "Fag…
$ taxonRank                     <chr> "SPECIES", "SPECIES", "SPECIES", "SPECIE…
$ coordinateUncertaintyInMeters <dbl> 5113, 10, 6, NA, 4, 10, 3, 3, 3, 4, 4, 1…
$ year                          <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024…
$ basisOfRecord                 <chr> "HUMAN_OBSERVATION", "HUMAN_OBSERVATION"…
$ institutionCode               <chr> "iNaturalist", "iNaturalist", "iNaturali…
$ datasetName                   <chr> "iNaturalist research-grade observations…
$ .val                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ .equ                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ .zer                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ .cap                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ .cen                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ .con                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ .summary                      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ geometry                      <POINT [°]> POINT (-81.03141 28.10737), POINT …

Let’s plot the distribution of Quercus minima occurrences.

Code
sf_use_s2(FALSE)
Spherical geometry (s2) switched off
Code
# world map
worldMap <- rnaturalearth::ne_countries(scale = "medium", 
                                        type = "countries", 
                                        returnclass = "sf")

# country subset
USpoly <- worldMap %>% 
  #filter(region_wb == "North America")
  filter(admin == "United States of America")

Plot the occurrence records

Code
map_US <- map_data('world')[map_data('world')$region == "USA", ]

ggplot() + 
  geom_polygon(data = map_data("world"), 
               aes(x = long, y = lat, group = group), 
               color = "#f1f2f3", fill = '#f3f3f3') + 
  geom_polygon(data = map_US, 
               aes(x = long, y = lat, group = group), 
               color = 'lightgray', fill = 'lightgray', alpha = 0.5) + 
  geom_point(data = occ_quemin, 
             aes(x = decimalLongitude, y = decimalLatitude), 
             color = "darkgray", alpha = 0.9) + 
  coord_map() + 
  coord_fixed(1.3, 
              xlim = c(-65, -125), 
              ylim = c(50, 25)
              ) + 
  theme(panel.background = element_rect(fill = "lightblue"))
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.

Small joke 😬🫣.

Now the real map…

Code
ggplot() + 
  geom_polygon(data = map_data("world"), 
               aes(x = long, y = lat, group = group), 
               color = "#f1f2f3", fill = '#f3f3f3', alpha = 0.5) + 
  geom_sf(data = USpoly) + 
  geom_sf(data = quemin_spt, color = "darkgray", alpha = 0.9) + 
  coord_sf(
     xlim = c(-125, -65), 
     ylim = c(25, 50)
  ) + 
  theme(
    plot.background = element_rect(fill = "#f1f2f3"),
    panel.background = element_rect(fill = "lightblue"),
    panel.grid = element_blank(),
    line = element_blank(),
    rect = element_blank()
  )

Seems the data is correct but sometimes the data aggregation can cause some troubles in model predictions. To solve this issue we will apply a spatial thinning on the species occurrences, to do so, we will use the package {spThin}. Given that our species has a broad distribution let’s set a distance threshold of 2 km between occurrence points.

Code
quemin_thinned <- thin(
      loc.data =  occ_quemin_clean, 
      verbose = FALSE, 
      long.col = "decimalLongitude", 
      lat.col = "decimalLatitude",
      spec.col = "species",
      thin.par = 2, # points have at least a minimum distance of 2 km from each other
      reps = 10, # number of repetitions
      locs.thinned.list.return = TRUE, 
      write.files = FALSE, 
      out.dir = "data/OCC/")
    
quemin_thinned <- as.data.frame(quemin_thinned[[10]])
quemin_thinned$species <- "Quercus_minima"

Explore the results of the thinning process. We can see that the number of occurrences of Quercus minima decreased from 784 occurrences to 440 occurrences. We will use the thinned data for further analyses.

Code
glimpse(quemin_thinned)
Rows: 440
Columns: 3
$ Longitude <dbl> -81.12212, -82.27573, -81.29731, -81.88913, -82.58465, -80.4…
$ Latitude  <dbl> 27.90959, 27.23522, 26.81852, 27.00880, 28.31629, 25.65069, …
$ species   <chr> "Quercus_minima", "Quercus_minima", "Quercus_minima", "Querc…

Transform the thinned occurrences to a spatial object

Code
quemin_thinned_sf <- st_as_sf(quemin_thinned, 
                       coords = c("Longitude", "Latitude"), 
                       crs = 4326) 

# The below code is using R base. We will use this object for the modelling part
quemin_thinned_spt <- SpatialPointsDataFrame(coords = quemin_thinned[, 1:2], 
                                             data = quemin_thinned, 
                                             proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")) 

Now visualize the both spatial objects.

Code
ggplot() + 
  geom_polygon(data = map_data("world"), 
               aes(x = long, y = lat, group = group), 
               color = "#f1f2f3", fill = '#f3f3f3') + 
  geom_sf(data = USpoly) + 
  geom_sf(data = quemin_spt, color = "blue", alpha = 0.5) + 
  geom_sf(data = quemin_thinned_sf, color = "red", alpha = 0.3) + 
  coord_sf(
     xlim = c(-125, -65), 
     ylim = c(25, 50)
  ) +
  theme(
    plot.background = element_rect(fill = "#f1f2f3"),
    panel.background = element_rect(fill = "lightblue"),
    panel.grid = element_blank(),
    line = element_blank(),
    rect = element_blank()
  )

Now let’s save the data processed and keep thinned_spt for further analyses.

Code
save(occ_quemin, quemin_spt, quemin_thinned_sf, quemin_thinned, 
     file = "data/OCC/quemin_OCC_processed.RData")

rm(gbif_data, occ_quemin, quemin_spt)

Until here we have downloaded and cleaned species occurrence data, now we will prepare the environmental data that is going to be used as predictors for constructing ENM/SDM for our species.

2.2 Prepare environmental data

First we will obtain bioclimatic variables from WorldClim and to do that we will use the function ‘worldclim_global()’ from the package {geodata}, then we will select which variables are more relevant for explaining the distribution of our species.

Code
bios <- geodata::worldclim_global(var = "bio", # climatic variables
                         res = 10, # spatial resolution
                         path = "data/", # folder location
                         version = "2.1")

You can see that by using the argument path within the function worldclim_global() the bioclimatic variables were also downloaded directly to your hard drive, so you don’t need to download the data again.

Download the climatic data manually

If you have issues downloading WorldClim data, you can download the data HERE and then store the climatic data within the folder data.

2.2.1 Load bioclimatic data stored in your computer

If you downloaded the bioclimatic data from dropbox please use the next lines of code.

Code
biosNames <- list.files("data/wc2.1_10m/", 
                        pattern = "tif$")

# Please double check your folder path. 

bios <- rast(paste0("data/wc2.1_10m/", # folder directory 
                     biosNames)) # file names

Let’s explore some details of the bioclimatic data

Code
names(bios)
 [1] "wc2.1_10m_bio_1"  "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12"
 [5] "wc2.1_10m_bio_13" "wc2.1_10m_bio_14" "wc2.1_10m_bio_15" "wc2.1_10m_bio_16"
 [9] "wc2.1_10m_bio_17" "wc2.1_10m_bio_18" "wc2.1_10m_bio_19" "wc2.1_10m_bio_2" 
[13] "wc2.1_10m_bio_3"  "wc2.1_10m_bio_4"  "wc2.1_10m_bio_5"  "wc2.1_10m_bio_6" 
[17] "wc2.1_10m_bio_7"  "wc2.1_10m_bio_8"  "wc2.1_10m_bio_9" 
Code
str(bios[[1]])
S4 class 'SpatRaster' [package "terra"]

Now let’s plot some bioclimatic variables, let’s say BIO1 and BIO12 that are mean annual temperature and annual precipitation, respectively.

Code
plot(c(bios[[1]], bios[[12]]))

Okay, the environmental data is at global level but our species only occur in North America, in order to have a better visualization of the data, let’s crop the bioclimatic data to the extent of the United States.

Code
USpoly_spt <- as(USpoly, "Spatial") # used as extent

plot(USpoly_spt)

Now crop the bioclimatic data to the extent of North America.

Code
bios_US <- crop(bios, USpoly, mask = TRUE)
Warning: [crop] CRS do not match

Let’s visualize the results of cropping the bioclimatic data and overlap the thinned occurrences of Quercus minima.

Code
plot(bios_US[[1]]) # mean annual temperature
plot(quemin_thinned_spt, col = "red", pch = 16, add = TRUE) # add occurrence records
plot(USpoly_spt, lwd = 2, lty = 2, add = TRUE) # add country borders

Cool!

3 Accessible area

Before any further analysis we need to define the accessible area for our species (Quercus minima) in the geographical space, remember our species is distributed in the United States. To define the accessible area let’s use a bounding box around the known species occurrences plus ~300 km beyond that bound. This will give us an approximation of the species dispersal within the geographical domain, i.e., North America. This bounding box represent the component M in the BAM diagram.

Code
### Species specific accessible area
bb <- bbox(quemin_thinned_spt) # bounding box

ex <- ext(c(bb[1]-3, bb[3]+3, bb[2]-3, bb[4]+3)) # bounding box + 300 km

pex <- as.polygons(ex, crs = crs(bios_US)) #as(ex, 'SpatialPolygons') # transform to polygon
#crs(pex) <- crs(bios_US) # use the geographical reference of the bioclimatic variables

crs(USpoly_spt) <- crs(bios_US)

out <- terra::crop(USpoly_spt, as(pex, 'Spatial'), byid = FALSE) # use NAs to eliminate areas on the sea

Now let’s plot the accessible area.

Code
plot(bios_US[[1]])
plot(pex, add = TRUE, lty = 2)
plot(out, add = TRUE, lwd = 2)

Nice, we can see that the accessible area for our species is the southeast United States.

Now crop the bioclimatic data to the extent of the accessible area.

Code
bios_spp <- terra::crop(bios, out, mask = TRUE)
Warning: [crop] mask=TRUE is ignored for SpatExtent objects
Code
#bios_spp <- terra::mask(bios_spp, vect(out))

# plot the results
plot(bios_spp[[1]])
#plot(quemin_thinned_spt, add = TRUE, col = "red", pch = 16)
plot(USpoly_spt, add = TRUE, lty = 2)

Note that we will use this accessible area as extent for all the posterior analyses.

4 Pseudoabsences

To model the environmental niches (ghosts) and to project the species distributions (shadows) we need to have data of species presences and absences, however we only have presence data. What should we do if we don’t have absences? That’s the question! There is no a straightforward answer for that question, but the most common procedure to get absence data is to generate random points (AKA pseudoabsences) on the geographical domain (i.e., the accessible area or M). There is a lot of discussion on how many pseudoabsences we need to use for ENM/SDM, here we will use a conservative number, i.e., twice the number presences.

Code
set.seed(12345) # Random Number Generation to obtain the same result

# Generate the data
absence <- dismo::randomPoints(mask = as(bios_spp[[1]], "Raster"), # transform spatraster to raster
                        n = round(nrow(quemin_thinned)*2, 0), # number of pseudoabsences
                        p = quemin_thinned_spt, # presence object
                        ext = c(bb[1]-3, bb[3]+3, bb[2]-3, bb[4]+3) # extent
                        )

Now let’s combine the presence and pseudoabsence data

Code
# Presences
presence <- data.frame(quemin_thinned) # presence data

# Pseudoabsences
absence <- data.frame(absence) # pseudoabsence data
absence$species <- "Quercus_minima"
names(absence) <- names(presence)

presence$Occurrence <- 1 # presence data
absence$Occurrence <- 0 # pseudoabsence data

quemin <- rbind(presence, absence) # combine both information

Finally transform the presence-pseudoabsence data to a spatial object and visualize it!

Code
quemin_sp <- quemin

coordinates(quemin_sp) <- ~ Longitude + Latitude
crs(quemin_sp) <- crs(bios_spp)

quemin_sp
class       : SpatialPointsDataFrame 
features    : 1320 
extent      : -100.25, -74.25, 24.58333, 39.75  (xmin, xmax, ymin, ymax)
crs         : +proj=longlat +datum=WGS84 +no_defs 
variables   : 2
names       :        species, Occurrence 
min values  : Quercus_minima,          0 
max values  : Quercus_minima,          1 
Code
quemin_sf <- st_as_sf(quemin, 
                       coords = c("Longitude", "Latitude"), 
                       crs = crs(bios_spp)) 

Let’s visualize the results

Code
plot(bios_spp[[1]])
plot(quemin_sp[quemin_sp$Occurrence == 1, ], col = "red", add = TRUE, pch = 16) # presence
points(quemin_sp[quemin_sp$Occurrence == 0, ], col = "blue", pch = 16) # absence

We can save the processed data and clean the environment a little bit.

Code
save(presence, absence, quemin, file = "data/OCC/quemin_PresAbs.RData")

save(bb, ex, USpoly_spt, out, pex, file = "data/accessible_area_quemin.RData")

rm(absence, presence, bios, ex, out, pex, bb)

Now, we need to decide which variables are necessary to model the niche (Grinnellian niche or Fundamental niche or the Ghost) for our oak species. Here we can rely the selection of the variables by asking a specialist or use statistical tools or both.

5 Variable selection

Let’s use the statistical way.

Code
## extract variables
quemin_bios <- data.frame(terra::extract(bios_spp, quemin_sf))

## Combine coordinates and climate data
quemin_bios <- cbind(data.frame(quemin), quemin_bios)

## Remove NAs
quemin_bios <- quemin_bios[complete.cases(quemin_bios), ]
quemin_bios <- na.omit(quemin_bios)

glimpse(quemin_bios)
Rows: 1,320
Columns: 24
$ Longitude        <dbl> -81.12212, -82.27573, -81.29731, -81.88913, -82.58465…
$ Latitude         <dbl> 27.90959, 27.23522, 26.81852, 27.00880, 28.31629, 25.…
$ species          <chr> "Quercus_minima", "Quercus_minima", "Quercus_minima",…
$ Occurrence       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ID               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ wc2.1_10m_bio_1  <dbl> 22.29482, 22.62983, 22.98208, 22.70179, 21.99941, 24.…
$ wc2.1_10m_bio_10 <dbl> 27.21588, 27.56979, 27.47992, 27.51846, 27.40946, 27.…
$ wc2.1_10m_bio_11 <dbl> 16.67392, 17.02667, 17.91229, 17.24879, 15.77767, 20.…
$ wc2.1_10m_bio_12 <dbl> 1217, 1345, 1251, 1274, 1325, 1470, 1287, 1275, 1324,…
$ wc2.1_10m_bio_13 <dbl> 181, 229, 218, 208, 209, 247, 176, 192, 188, 207, 206…
$ wc2.1_10m_bio_14 <dbl> 43, 40, 40, 41, 52, 37, 58, 52, 62, 87, 41, 46, 44, 4…
$ wc2.1_10m_bio_15 <dbl> 53.25320, 63.83318, 61.05693, 62.13505, 48.60321, 60.…
$ wc2.1_10m_bio_16 <dbl> 526, 641, 585, 600, 560, 617, 511, 539, 543, 565, 570…
$ wc2.1_10m_bio_17 <dbl> 158, 165, 140, 149, 201, 142, 195, 177, 198, 288, 140…
$ wc2.1_10m_bio_18 <dbl> 506, 629, 523, 566, 551, 581, 511, 538, 543, 552, 525…
$ wc2.1_10m_bio_19 <dbl> 171, 187, 146, 166, 238, 142, 209, 191, 220, 350, 156…
$ wc2.1_10m_bio_2  <dbl> 12.043104, 12.271375, 12.762083, 12.540167, 11.448771…
$ wc2.1_10m_bio_3  <dbl> 51.20584, 52.59687, 55.31416, 53.86959, 48.26581, 52.…
$ wc2.1_10m_bio_4  <dbl> 434.3550, 436.0064, 395.7274, 424.9552, 479.4438, 321…
$ wc2.1_10m_bio_5  <dbl> 32.96900, 33.28675, 33.54925, 33.33500, 32.69825, 32.…
$ wc2.1_10m_bio_6  <dbl> 9.45000, 9.95575, 10.47725, 10.05625, 8.97800, 14.303…
$ wc2.1_10m_bio_7  <dbl> 23.51900, 23.33100, 23.07200, 23.27875, 23.72025, 18.…
$ wc2.1_10m_bio_8  <dbl> 27.17908, 27.49458, 27.33037, 27.42100, 27.35733, 27.…
$ wc2.1_10m_bio_9  <dbl> 17.83258, 18.23458, 19.23537, 18.49721, 19.76171, 20.…

One way to select variables is to explore the correlation among the predictors. We can use a threshold of |r| < 0.7 which means that correlated variables below this threshold can be considered as no problematic (Dormann et al., 2013), however you can use a more conservative threshold, such as <0.5.

To do that, we first estimate a correlation matrix from the predictors. We use Spearman rank correlation coefficient, as we do not know whether all variables are normally distributed.

Code
cor_mat <- cor(quemin_bios[, c(6:24)], method = 'spearman')

Let’s visualize the correlation matrix.

Code
corrplot::corrplot.mixed(cor_mat, 
                         tl.pos = "lt", 
                         tl.cex = 0.5, 
                         number.cex = 0.5, 
                         addCoefasPercent = TRUE, 
                         mar = c(0, 0, 1, 0))

Looks like that several variables are highly correlated, but let’s select the ones that are highly correlated with each other and exclude them for further analysis.

Inspecting predictor by predictor can take forever, what should we do? Well we have two options:

  1. Talk with the specialist about which predictors should we use, or

  2. Let the computer decide

The simple solution is letting the computer decide, thus, for doing that, we will use an amazing function called ‘select07()’ from the package {mecofun} that will select the predictors bellow the correlation threshold.

I extracted the function select07() from the package {mecofun} so we can use it directly without install the {mecofun}.

Code
set.seed(12345)

#source("R-Functions/select07_mod.R")
source("https://raw.githubusercontent.com/jesusNPL/BiodiversityScience/master/Spring2024/Lab_5_ENM_SDM/functions/select07_mod.R")

# Run select07()
covar_sel <- select07_v2(X = quemin_bios[, -c(1:5)], # only predictors data 
                         y = quemin_bios$Occurrence, # presence-absence data 
                         threshold = 0.7) # here you can change the threshold for one 
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Note that if you want to install {mecofun} you can use the code below. Important, a message will appear in your console asking if you would like to update an existing package, please type 3 and press Enter. This will order R to install just {mecofun} and also will avoid any error in the installation.

Code
# Install the package from source
remotes::install_git("https://gitup.uni-potsdam.de/macroecology/mecofun.git")

The function will return a list of three objects:

  1. AIC values for each model

  2. The correlation matrix

  3. A vector of the selected variables.

Code
# Check out the structure of the resulting object:
str(covar_sel)
List of 3
 $ AIC     : Named num [1:19] 465 507 529 608 677 ...
  ..- attr(*, "names")= chr [1:19] "wc2.1_10m_bio_18" "wc2.1_10m_bio_16" "wc2.1_10m_bio_13" "wc2.1_10m_bio_7" ...
 $ cor_mat : num [1:19, 1:19] 1 0.9 0.988 0.271 0.668 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:19] "wc2.1_10m_bio_1" "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12" ...
  .. ..$ : chr [1:19] "wc2.1_10m_bio_1" "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12" ...
 $ pred_sel: chr [1:7] "wc2.1_10m_bio_18" "wc2.1_10m_bio_4" "wc2.1_10m_bio_9" "wc2.1_10m_bio_5" ...

Let’s see the results…

Code
covar_sel$AIC
wc2.1_10m_bio_18 wc2.1_10m_bio_16 wc2.1_10m_bio_13  wc2.1_10m_bio_7 
        464.6303         507.2489         529.1343         607.7619 
 wc2.1_10m_bio_4  wc2.1_10m_bio_6  wc2.1_10m_bio_8 wc2.1_10m_bio_11 
        676.7551         759.2440         774.1234         799.5277 
 wc2.1_10m_bio_1  wc2.1_10m_bio_3  wc2.1_10m_bio_9 wc2.1_10m_bio_10 
        860.1355         888.6783         974.2085        1001.8719 
 wc2.1_10m_bio_5 wc2.1_10m_bio_15 wc2.1_10m_bio_12  wc2.1_10m_bio_2 
       1126.0436        1242.5124        1307.1922        1435.6041 
wc2.1_10m_bio_19 wc2.1_10m_bio_14 wc2.1_10m_bio_17 
       1555.7457        1562.0175        1571.7289 
Code
covar_sel$cor_mat
                 wc2.1_10m_bio_1 wc2.1_10m_bio_10 wc2.1_10m_bio_11
wc2.1_10m_bio_1        1.0000000       0.89958719        0.9881272
wc2.1_10m_bio_10       0.8995872       1.00000000        0.8432467
wc2.1_10m_bio_11       0.9881272       0.84324672        1.0000000
wc2.1_10m_bio_12       0.2714835       0.03982751        0.3443746
wc2.1_10m_bio_13       0.6675724       0.42365968        0.7236448
wc2.1_10m_bio_14      -0.3146627      -0.45039143       -0.2406941
wc2.1_10m_bio_15       0.7119316       0.70384566        0.6717250
wc2.1_10m_bio_16       0.6024811       0.32690785        0.6714260
wc2.1_10m_bio_17      -0.2991749      -0.43223588       -0.2276651
wc2.1_10m_bio_18       0.5235346       0.21999222        0.6046156
wc2.1_10m_bio_19      -0.1992819      -0.34988136       -0.1288383
wc2.1_10m_bio_2       -0.2929485      -0.19159192       -0.3145510
wc2.1_10m_bio_3        0.8531918       0.64442047        0.8936277
wc2.1_10m_bio_4       -0.9307230      -0.71895097       -0.9713914
wc2.1_10m_bio_5        0.4500501       0.66077433        0.3675603
wc2.1_10m_bio_6        0.9775318       0.82178575        0.9941740
wc2.1_10m_bio_7       -0.8928538      -0.67111908       -0.9400411
wc2.1_10m_bio_8        0.8106921       0.70325599        0.8088511
wc2.1_10m_bio_9        0.5185878       0.38716314        0.5453769
                 wc2.1_10m_bio_12 wc2.1_10m_bio_13 wc2.1_10m_bio_14
wc2.1_10m_bio_1        0.27148355       0.66757239      -0.31466268
wc2.1_10m_bio_10       0.03982751       0.42365968      -0.45039143
wc2.1_10m_bio_11       0.34437460       0.72364482      -0.24069412
wc2.1_10m_bio_12       1.00000000       0.74925052       0.67635869
wc2.1_10m_bio_13       0.74925052       1.00000000       0.13719884
wc2.1_10m_bio_14       0.67635869       0.13719884       1.00000000
wc2.1_10m_bio_15      -0.15672747       0.43890333      -0.78098048
wc2.1_10m_bio_16       0.77576631       0.97097985       0.20130467
wc2.1_10m_bio_17       0.70592274       0.16490873       0.97819976
wc2.1_10m_bio_18       0.70194692       0.88843488       0.20356520
wc2.1_10m_bio_19       0.76391906       0.24564578       0.92518226
wc2.1_10m_bio_2       -0.34176051      -0.44155866      -0.15512179
wc2.1_10m_bio_3        0.35423655       0.67731666      -0.18385018
wc2.1_10m_bio_4       -0.42695119      -0.76307124       0.12287208
wc2.1_10m_bio_5       -0.28345856      -0.02117586      -0.54613081
wc2.1_10m_bio_6        0.38452231       0.74159250      -0.18333577
wc2.1_10m_bio_7       -0.49380050      -0.79056108       0.03614669
wc2.1_10m_bio_8        0.12291958       0.57898817      -0.37838284
wc2.1_10m_bio_9        0.56960233       0.52612490       0.23423021
                 wc2.1_10m_bio_15 wc2.1_10m_bio_16 wc2.1_10m_bio_17
wc2.1_10m_bio_1        0.71193156        0.6024811      -0.29917495
wc2.1_10m_bio_10       0.70384566        0.3269079      -0.43223588
wc2.1_10m_bio_11       0.67172498        0.6714260      -0.22766510
wc2.1_10m_bio_12      -0.15672747        0.7757663       0.70592274
wc2.1_10m_bio_13       0.43890333        0.9709799       0.16490873
wc2.1_10m_bio_14      -0.78098048        0.2013047       0.97819976
wc2.1_10m_bio_15       1.00000000        0.3666027      -0.76856713
wc2.1_10m_bio_16       0.36660271        1.0000000       0.22366008
wc2.1_10m_bio_17      -0.76856713        0.2236601       1.00000000
wc2.1_10m_bio_18       0.31514828        0.9368405       0.21308111
wc2.1_10m_bio_19      -0.67696649        0.3016099       0.94683693
wc2.1_10m_bio_2       -0.08735412       -0.4467964      -0.15940192
wc2.1_10m_bio_3        0.59117464        0.6616263      -0.18081620
wc2.1_10m_bio_4       -0.58237286       -0.7278881       0.11748199
wc2.1_10m_bio_5        0.52839518       -0.1024151      -0.52367328
wc2.1_10m_bio_6        0.62912799        0.6928807      -0.16950689
wc2.1_10m_bio_7       -0.51662752       -0.7627136       0.02797589
wc2.1_10m_bio_8        0.71633177        0.5455706      -0.38115767
wc2.1_10m_bio_9        0.09967972        0.5298675       0.26391191
                 wc2.1_10m_bio_18 wc2.1_10m_bio_19 wc2.1_10m_bio_2
wc2.1_10m_bio_1         0.5235346      -0.19928185     -0.29294853
wc2.1_10m_bio_10        0.2199922      -0.34988136     -0.19159192
wc2.1_10m_bio_11        0.6046156      -0.12883827     -0.31455098
wc2.1_10m_bio_12        0.7019469       0.76391906     -0.34176051
wc2.1_10m_bio_13        0.8884349       0.24564578     -0.44155866
wc2.1_10m_bio_14        0.2035652       0.92518226     -0.15512179
wc2.1_10m_bio_15        0.3151483      -0.67696649     -0.08735412
wc2.1_10m_bio_16        0.9368405       0.30160986     -0.44679637
wc2.1_10m_bio_17        0.2130811       0.94683693     -0.15940192
wc2.1_10m_bio_18        1.0000000       0.26723053     -0.45558339
wc2.1_10m_bio_19        0.2672305       1.00000000     -0.08663581
wc2.1_10m_bio_2        -0.4555834      -0.08663581      1.00000000
wc2.1_10m_bio_3         0.6406129      -0.05751656     -0.03070343
wc2.1_10m_bio_4        -0.6972318       0.02880025      0.35605876
wc2.1_10m_bio_5        -0.2392596      -0.39448515      0.43328567
wc2.1_10m_bio_6         0.6313973      -0.07761940     -0.38781719
wc2.1_10m_bio_7        -0.7421538      -0.04873234      0.49445338
wc2.1_10m_bio_8         0.6002015      -0.36677648     -0.38415718
wc2.1_10m_bio_9         0.3723274       0.40341939     -0.17056310
                 wc2.1_10m_bio_3 wc2.1_10m_bio_4 wc2.1_10m_bio_5
wc2.1_10m_bio_1       0.85319183     -0.93072303      0.45005006
wc2.1_10m_bio_10      0.64442047     -0.71895097      0.66077433
wc2.1_10m_bio_11      0.89362768     -0.97139143      0.36756027
wc2.1_10m_bio_12      0.35423655     -0.42695119     -0.28345856
wc2.1_10m_bio_13      0.67731666     -0.76307124     -0.02117586
wc2.1_10m_bio_14     -0.18385018      0.12287208     -0.54613081
wc2.1_10m_bio_15      0.59117464     -0.58237286      0.52839518
wc2.1_10m_bio_16      0.66162629     -0.72788813     -0.10241507
wc2.1_10m_bio_17     -0.18081620      0.11748199     -0.52367328
wc2.1_10m_bio_18      0.64061294     -0.69723178     -0.23925963
wc2.1_10m_bio_19     -0.05751656      0.02880025     -0.39448515
wc2.1_10m_bio_2      -0.03070343      0.35605876      0.43328567
wc2.1_10m_bio_3       1.00000000     -0.91458803      0.33827323
wc2.1_10m_bio_4      -0.91458803      1.00000000     -0.19809501
wc2.1_10m_bio_5       0.33827323     -0.19809501      1.00000000
wc2.1_10m_bio_6       0.87028274     -0.97620056      0.30235776
wc2.1_10m_bio_7      -0.84556739      0.98172103     -0.08365390
wc2.1_10m_bio_8       0.68595087     -0.78673903      0.23025759
wc2.1_10m_bio_9       0.51090011     -0.53947380      0.14160680
                 wc2.1_10m_bio_6 wc2.1_10m_bio_7 wc2.1_10m_bio_8
wc2.1_10m_bio_1        0.9775318     -0.89285383       0.8106921
wc2.1_10m_bio_10       0.8217858     -0.67111908       0.7032560
wc2.1_10m_bio_11       0.9941740     -0.94004108       0.8088511
wc2.1_10m_bio_12       0.3845223     -0.49380050       0.1229196
wc2.1_10m_bio_13       0.7415925     -0.79056108       0.5789882
wc2.1_10m_bio_14      -0.1833358      0.03614669      -0.3783828
wc2.1_10m_bio_15       0.6291280     -0.51662752       0.7163318
wc2.1_10m_bio_16       0.6928807     -0.76271357       0.5455706
wc2.1_10m_bio_17      -0.1695069      0.02797589      -0.3811577
wc2.1_10m_bio_18       0.6313973     -0.74215380       0.6002015
wc2.1_10m_bio_19      -0.0776194     -0.04873234      -0.3667765
wc2.1_10m_bio_2       -0.3878172      0.49445338      -0.3841572
wc2.1_10m_bio_3        0.8702827     -0.84556739       0.6859509
wc2.1_10m_bio_4       -0.9762006      0.98172103      -0.7867390
wc2.1_10m_bio_5        0.3023578     -0.08365390       0.2302576
wc2.1_10m_bio_6        1.0000000     -0.96290997       0.8066164
wc2.1_10m_bio_7       -0.9629100      1.00000000      -0.7751041
wc2.1_10m_bio_8        0.8066164     -0.77510413       1.0000000
wc2.1_10m_bio_9        0.5644109     -0.55273422       0.1524807
                 wc2.1_10m_bio_9
wc2.1_10m_bio_1       0.51858779
wc2.1_10m_bio_10      0.38716314
wc2.1_10m_bio_11      0.54537686
wc2.1_10m_bio_12      0.56960233
wc2.1_10m_bio_13      0.52612490
wc2.1_10m_bio_14      0.23423021
wc2.1_10m_bio_15      0.09967972
wc2.1_10m_bio_16      0.52986750
wc2.1_10m_bio_17      0.26391191
wc2.1_10m_bio_18      0.37232743
wc2.1_10m_bio_19      0.40341939
wc2.1_10m_bio_2      -0.17056310
wc2.1_10m_bio_3       0.51090011
wc2.1_10m_bio_4      -0.53947380
wc2.1_10m_bio_5       0.14160680
wc2.1_10m_bio_6       0.56441094
wc2.1_10m_bio_7      -0.55273422
wc2.1_10m_bio_8       0.15248070
wc2.1_10m_bio_9       1.00000000
Code
covar_sel$pred_sel
[1] "wc2.1_10m_bio_18" "wc2.1_10m_bio_4"  "wc2.1_10m_bio_9"  "wc2.1_10m_bio_5" 
[5] "wc2.1_10m_bio_15" "wc2.1_10m_bio_2"  "wc2.1_10m_bio_19"

According to the select07() function eight climatic variables (predictors) best fit the data and also have low correlation. Assuming that this is correct, we will use these variables for further analyses. The selected variables are: “bio18”, “bio4”, “bio9”, “bio5”, “bio15”, “bio2”, “bio19”. Store these variable names as a vector.

Code
preds <- covar_sel$pred_sel
preds
[1] "wc2.1_10m_bio_18" "wc2.1_10m_bio_4"  "wc2.1_10m_bio_9"  "wc2.1_10m_bio_5" 
[5] "wc2.1_10m_bio_15" "wc2.1_10m_bio_2"  "wc2.1_10m_bio_19"

Finally, let’s select the bioclimatic variables and plot them. Note that we will use these environmental layers for model predictions.

Code
bios_quemin_sel <- bios_spp[[preds]]

plot(bios_quemin_sel)

As a pre-final step we will scale the predictors

Code
quemin_bios_sel <- quemin_bios %>% 
  select(Occurrence, all_of(preds))

quemin_bios_sel[, 2:8] <- scale(quemin_bios_sel[, 2:8])

Okay, we are ready to build the models for our species.

6 Building Environmental Niche Models

To model the environmental niches for Quercus minima we will use the R package {cito}. The first thing we need to do is to filter some data for training and some data for testing. We will use ~30% of the data for testing and ~70% for training.

Code
set.seed(12345)

index <- sample.int(nrow(quemin_bios_sel), 400) # 400 points or 0.33% of the data 

## Testing data
testing <- quemin_bios_sel[index, ] 

## Data for training our model
training <- quemin_bios_sel[-index, ]

Now let’s model the species-environment relationship or Grinnellian niches.

Code
library(cito)

Run Deep Neural Network

Code
set.seed(12345)

quemin_DNN <- dnn(Occurrence ~., # formula
            data = training, # data for training
            batchsize = 150L, 
            epochs = 100L,
            lr = 0.05, 
            #lambda = 0.001,
            #alpha = 1.0,
            validation = 0.33, 
            loss = "binomial", # model 
            #early_stopping = 15, 
            bootstrap = 20L, # number of bootstraps set to 20
            verbose = FALSE)

Visualize Network

Code
plot(quemin_DNN)

Visualize the bootstrap output. If an error is returned, you can skip this part.

Code
analyze_training(quemin_DNN)

Until here we have downloaded and processed occurrence data and fitted a deep learning algorithm. Now let’s predict the Grinnellian niche for Quercus minima in the accessible area. But first let’s explore how good our model is. In addition, you can use the fitted object (i.e., quemin_DNN) to make predictions to other regions (e.g., areas with similar environmental conditions or invadable areas) or to project to other periods of time (i.e., to the future or the past).

Code
quemin_DNN_pred <- predict(quemin_DNN, 
                      newdata = testing, 
                      reduce = "none")

dim(quemin_DNN_pred)
[1]  20 400   1

Let’s get the AUC (Area under the ROC Curve) from the prediction object.

Code
AUCs <- sapply(1:20, function(i) Metrics::auc(testing$Occurrence, quemin_DNN_pred[i,,]))

hist(AUCs,
     xlim = c(0.7, 1), 
     main = "AUC of bootstrap model", 
     xlab = "AUC")

We can see that the AUC is very high ranging between 0.980 and 0.984.

Get variable (feature) importance or the variables that are more important in predicting the Grinnellian niche of our species.

Code
quemin_Response <- summary(quemin_DNN)
quemin_Response
Summary of Deep Neural Network Model
── Feature Importance  
                              Importance Std.Err Z value Pr(>|z|)    
wc2.1_10m_bio_18 → Occurrence    0.13371 0.02378    5.62  1.9e-08 ***
wc2.1_10m_bio_4 → Occurrence     0.03211 0.01255    2.56   0.0105 *  
wc2.1_10m_bio_9 → Occurrence     0.00483 0.00284    1.70   0.0890 .  
wc2.1_10m_bio_5 → Occurrence     0.00420 0.00168    2.50   0.0123 *  
wc2.1_10m_bio_15 → Occurrence    0.01910 0.01031    1.85   0.0638 .  
wc2.1_10m_bio_2 → Occurrence     0.00704 0.00282    2.50   0.0125 *  
wc2.1_10m_bio_19 → Occurrence    0.00450 0.00165    2.73   0.0064 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
── Average Conditional Effects 
                                  ACE Std.Err Z value Pr(>|z|)    
wc2.1_10m_bio_18 → Occurrence  2.0150  0.2447    8.23  < 2e-16 ***
wc2.1_10m_bio_4 → Occurrence  -1.0124  0.2776   -3.65  0.00027 ***
wc2.1_10m_bio_9 → Occurrence   0.0996  0.2356    0.42  0.67233    
wc2.1_10m_bio_5 → Occurrence  -0.1402  0.1814   -0.77  0.43955    
wc2.1_10m_bio_15 → Occurrence  0.7558  0.2543    2.97  0.00296 ** 
wc2.1_10m_bio_2 → Occurrence  -0.3440  0.2093   -1.64  0.10027    
wc2.1_10m_bio_19 → Occurrence  0.0598  0.1677    0.36  0.72154    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
── Standard Deviation of Conditional Effects  
                                 ACE Std.Err Z value Pr(>|z|)    
wc2.1_10m_bio_18 → Occurrence 0.6512  0.1231    5.29  1.2e-07 ***
wc2.1_10m_bio_4 → Occurrence  0.3586  0.1203    2.98  0.00287 ** 
wc2.1_10m_bio_9 → Occurrence  0.3450  0.0965    3.58  0.00035 ***
wc2.1_10m_bio_5 → Occurrence  0.3645  0.1105    3.30  0.00097 ***
wc2.1_10m_bio_15 → Occurrence 0.4845  0.1217    3.98  6.8e-05 ***
wc2.1_10m_bio_2 → Occurrence  0.5683  0.1200    4.74  2.2e-06 ***
wc2.1_10m_bio_19 → Occurrence 0.4014  0.0887    4.53  6.0e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How do you feel about that?

Which variable is, in fact, not that important in predicting the Grinnellian niche of Quercus minima? Hint: look at the Feature Importance section in the object quemin_Response

6.1 Biological inferences

We can also go a step ahead and make some inferences. For example, we can see how the species is responding to the set of environmental variables we use to model its Grinnellian niche.

Code
ALE(model = quemin_DNN, variable = preds[1])

Code
ALE(model = quemin_DNN, variable = preds[2])

Code
ALE(model = quemin_DNN, variable = preds[3])

Code
ALE(model = quemin_DNN, variable = preds[4])

Code
ALE(model = quemin_DNN, variable = preds[5])

Code
ALE(model = quemin_DNN, variable = preds[6])

Code
ALE(model = quemin_DNN, variable = preds[7])

How do you feel about that?

Now let’s predict the Grinnellian niche for Quercus minima in the accessible area.

Auxiliary functions to perform predictions obtained from (https://citoverse.github.io/cito/articles/C-Example_Species_distribution_modeling.html)

Code
## mean prediction
customPredictFun <- function(model, data) {
  return(apply(predict(model, data, reduce = "none"), 2:3, mean)[, 1])
}

## standar deviation or uncertainty associated to the prediction
customPredictFun_sd <- function(model, data) {
  return(apply(predict(model, data, reduce = "none"), 2:3, sd)[, 1])
} 

Perform model prediction

Code
## Scale the predictors 
bios_quemin_sel_scale <- terra::scale(bios_quemin_sel)

## Predict the Grinnellian niche for Quercus minima
quemin_ENM <- terra::predict(object = bios_quemin_sel_scale,
                  model = quemin_DNN,
                  fun = customPredictFun, 
                  na.rm = TRUE)

Let’s try making a nice map.

Code
library(rasterVis)
Loading required package: lattice
Code
library(RColorBrewer)

mapTheme <- rasterTheme(region = rev(brewer.pal(11, "Spectral")),
  layout.widths = list(right.padding = 10),
  axis.line = list(col = "transparent"),
  tick = list(col = 'transparent'))

levelplot(quemin_ENM,
  maxpixels = 1e10,
  margin = FALSE,
  par.settings = mapTheme,
  scales = list(x = list(draw = TRUE),
                y = list(draw = TRUE)),
  zlim = c(0, 1))

We can also predict the uncertainty of our model

Code
quemin_ENM_uncertainty <- terra::predict(object = bios_quemin_sel_scale,
                  model = quemin_DNN,
                  fun = customPredictFun_sd, 
                  na.rm = TRUE)

Plot the uncertainty in our prediction

Code
levelplot(quemin_ENM_uncertainty,
  maxpixels = 1e10,
  margin = FALSE,
  par.settings = mapTheme,
  scales = list(x = list(draw = TRUE),
                y = list(draw = TRUE)),
  zlim = c(0, 1))

You might also want to save the predictions.

Code
dir.create("results")
Warning in dir.create("results"): 'results' already exists
Code
dir.create("results/ENM")
Warning in dir.create("results/ENM"): 'results/ENM' already exists
Code
# save the predictions
writeRaster(quemin_ENM, 
            filename = "results/ENM/quemin_ENM.tif", 
            overwrite = TRUE)

# save the uncertainty prediction
writeRaster(quemin_ENM_uncertainty, 
            filename = "results/ENM/quemin_ENM_uncertainty.tif", 
            overwrite = TRUE)

Finally, let’s produce “shadows” from the “ghosts”.

7 Building Species Distributions

7.1 Species geographical distributions

First we need to set a binarization threshold for the model predictions, the threshold (cut-off) is used to transform model predictions (probabilities, distances, or similar values) to a binary score (presence or absence).

Code
## Get mean and sd predictions
meanPred <- apply(as.data.frame.array(quemin_DNN_pred), MARGIN = 2, FUN = mean)
sdPred <- apply(as.data.frame.array(quemin_DNN_pred), MARGIN = 2, FUN = sd)

## Create a prediction object
predOBJ <- ROCR::prediction(meanPred, testing$Occurrence)

## Get global AUC
quemin_AUC <- performance(predOBJ, "auc")@y.values[[1]]

## Plot AUC
ROCdf <- performance(predOBJ, "tpr", "fpr")

ROCdf <- data.frame(fpr = ROCdf@x.values[[1]], 
                    tpr = ROCdf@y.values[[1]])

ROCdf %>% 
  ggplot(aes(x = fpr, y = tpr)) + 
  geom_line(linewidth = 2, color = "darkgray") + 
  geom_abline(intercept = 0, slope = 1, color = 'red', linewidth = 2) + 
  ggtitle('Receiver-operator curve') + 
  xlab('False positive rate') + 
  ylab('True positive rate') + 
  theme_classic()

Here we will find the threshold for the mean prediction, but you can find the threshold for each bootstrap. The threshold we will use intents to maximizes the sensitivity (ability of a test to correctly identify presences or true positive rate) plus specificity (ability of a test to correctly identify absences or true negative rate) in the predictions.

Code
## Get sensitivity and specificity
quemin_PERF <- ROCR::performance(predOBJ, "sens", "spec") 

PERF_list <- (quemin_PERF@x.values[[1]] + quemin_PERF@y.values[[1]] - 1)

PERF_df <- data.frame(alpha = quemin_PERF@alpha.values[[1]], tss = PERF_list)

## Calculate Threshold
quemin_threshold <- min(PERF_df$alpha[which(PERF_df$tss == max(PERF_df$tss))])

## Calculate True skill statistics (TSS) 
quemin_TSS <- PERF_df[which(PERF_df$alpha == quemin_threshold), 'tss'] 

The threshold that maximizes the sensitivity and specificity is 0.39, we will use that value to produce “shadows” from the “ghosts”.

Code
quemin_SDM <- quemin_ENM

# reclassify our ensemble prediction
quemin_SDM[] <- ifelse(quemin_ENM[] >= quemin_threshold, 1, 0)

Plot the distribution of Quercus minima

Code
plot(quemin_SDM)
plot(countriesCoarse, add = TRUE)

That’s it, we modeled environmental niches (ghosts) to produce geographical distributions (shadows).

You might also want to save this final raster file in your hard drive.

Code
dir.create("results/SDM")
Warning in dir.create("results/SDM"): 'results/SDM' already exists
Code
writeRaster(quemin_SDM, 
            filename = "results/SDM/quemin_SDM.tif", 
            overwrite = TRUE)

8 The challenge

Please respond each question based on the practice.

  1. Do you think the observations are a reasonable representation of the distribution (and ecological niche) of the species? Explain your answer.

  2. Can we use ENM/SDM to aid the conservation of Quercus minima? Explain your answer.

  3. Based on the lecture and the practice, what can we conclude?

  4. Model the niche and predict the distribution of a different species (you can use one of the species listed at the beginning of the class).

The end, for now 😉!

References

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x